Hyndman, R. J., Koehler, A. B., Ord, J. K., & Snyder, R. D. (2008). Forecasting with exponential smoothing: The state space approach. Springer-Verlag.
Note
This notebook is not original material, but rather based on ref. 1 with notes expanding on some of the concepts contained in the book.
Motivation
So far we have seen Simple Exponential Smoothing, a forecasting model that lied in between the Mean and the Naive models:
The naive model assigned a constant value to all future forecasts, the last observation. It assigns all the weight to the latest observation
The mean model assigned a constant value to all future forecasts with an equal weight for each observation regardless of when it occurred.
Simple Exponential Smoothing also assigns a constant value to all future forecast, but this value is a weighted average that assigns the bigger weights to the most recent observations. The further in the past an observation lies, the smaller its assigned weight.
We now want to introduce trend and seasonality in exponential smoothing
Let us first recall the component form of simple exponential smoothing:
Holt extended Simple Exponential Smoothing to allow the forecasting of data with trend.
Component form of the equations
Unlike for Simple Exponential Smoothing, we will not derive the equations. We will accept them and then interpret them further down below. For those interested, see reference 3.
Differences with Simple Exponential Smoothing:
Introduces new parameter for the trend:\(b_t\) (analogous to the level \(l_t\)).
This implies inroducing a new equation for the trend.
The trend has its own smoothing parameter \(\beta^*\) (analogous to \(\alpha\)).
\(lt\) denotes an estimate of the level of the series at time t.
\(b_t\) denotes an estimate of the trend (slope) of the series at time t.
\(\alpha\) is the smoothing paraeter for the level.\(0 <= \alpha <= 1\)
\(\beta^*\) is the smoothing parameter for the trend. \(0 <= \beta^* <= 1\).
The reason for using \(\beta^*\) instead of \(\beta\) will be explained when talking about innovations state space models.
Equations for the fitted values
To understand how these equations help fit a time series with trend, let us particularize the forecast equation for \(h=1\) (fitted values are one-step ahead forecasts on the training data). We will subsequently particularize the equation at time \(t\) instead of the resulting \(t+1\) to make it consistent with the level and trend equations:
\[
\begin{align*}
\text{Forecast equation for h = 1} && \hat{y}_{t+1|t} &= \ell_{t} + 1 \cdot b_{t} \\
\text{Particularize for t instead t+1} && \hat{y}_{t|t-1} &= \ell_{t-1} + 1 \cdot b_{t-1} && \text{(fitted value at t)} \\
\end{align*}
\]
Therefore, the equations for the fitted values are:
\[
\begin{align*}
\text{Fitted value at t}&& \hat{y}_{t|t-1} &= \ell_{t-1} + b_{t-1} \\
\text{Level at t} && \ell_{t} &= \alpha y_{t} + (1 - \alpha)(\ell_{t-1} + b_{t-1})\\
\text{Trend at t} && b_{t} &= \beta^*(\ell_{t} - \ell_{t-1}) + (1 -\beta^*)b_{t-1}
\end{align*}
\]
These equations allow us to determine the fitted values if given:
The values of the smoothing parameters \(\alpha\), \(\beta^*\).
An initial value for the level \(l_{0}\).
An initial value for the trend \(b_{0}\).
The process to compute the fitted values having this data is as follows:
For \(t = 2\): we proceed in the same manner after having computed \(l_1\) and \(b_1\)
For \(t = 3\): we proceed in the same manner after having computed \(l_2\) and \(b_2\)
…
The figure below helps picture the process:
Interpreting the equations
Now we are ready to make an interpretation of the equations.
Level equation
The level equation shows that \(l_t\)is a weighted average of the observation\(y_t\) and the one-step-ahead training forecast for time t. This weighted average is determined by the parameter \(\alpha\). That is, \(\alpha\) is the smoothing parameter for the level.
NOTE: we can see that \(\ell_{t-1} + b_{t-1}\) is the one step ahead forecast at time \(t\) by particularizing the forecast equation at time \(t\) and substituting for \(h = 1\). That is: \(1 \cdot b_{t}\)
\[
\begin{align*}
\text{Forecast equation for h = 1} && \hat{y}_{t+1|t} &= \ell_{t} + 1 \cdot b_{t} \\
\text{Particularize for t instead t+1} && \hat{y}_{t|t-1} &= \ell_{t-1} + 1 \cdot b_{t-1} \\
\end{align*}
\]
As you can see, this shows that \(\ell_{t-1} + b_{t-1}\) is the one step ahead forecast at time \(t\).
Trend equation
The trend equation shows that \(b_t\)is a weighted average of the estimated trend at time t and the previous estimate of the trend. This weighted average is determined by the parameter \(\beta^*\)
NOTE:\(\ell_{t} - \ell_{t-1}\) is an estimate of the trend at time \(t\) in terms of \(l_t\) and \(l_{t-1}\). That is, only in terms of the information up to \(t\) (not beyond \(t\)).
If we take the level \(l_t\) as an estimate of the value of time series, then the difference between consecutive levels divided by 1 (the distance in timesteps between \(l_t\) and \(l_{t-1}\)) is an estimate of the slope at time t. The figure below should clarify this:
Equation for the fitted values
The fitted value at time t is simply the previous level plus the previous slope. That is: starting at \(l_{t-1}\) advancing one timestep with the slope \(b_{t-1}\) the fitted value at \(t\) is obtained.
\[
\begin{align*}
\text{Fitted value at t}&& \hat{y}_{t|t-1} &= \ell_{t-1} + 1 \cdot b_{t-1} \\
\end{align*}
\]
Equation for the forecasts
Once the fitted values \(\hat{y_t}\), the levels \(l_t\) and the slopes \(b_t\) have been computed for every value of \(t\) in the training region (from \(t=1\) to \(t=T\)) (that is, once we have computed the fitted values), the forecasts are given by:
By design, the forecast function is no longer flat but trending with a constant slope \(b_T\) (the last estimate of the trend)
h-step ahead forecast is equal to the last estimated level \(l_T\) plus h times the last estimated trend value \(b_T\) As such, the forecasts are a linear function of h.
Effect of the value of \(\beta*\)
A value of the parameter beta that is too high can result in a trend component that is too responsive and leads to fitted values that follow the data too closely. Following all the small floctuations in the data leads to overfitting.
The following graph shows a time series (in red) and the corresponding fited values for \(\alpha=0.4\) and \(\beta\) ranging from 0 to 1.
Fitting process
Fitting Holt’s Method to a time series implies the following process to minimize the Sum of Squared Residuals (SSE). In this case, we are not going to implement this from scratch as we did for simple exponential smoothing. An understanding of the process is sufficient.
As an example, the table below shows steps 1 and 2 assuming initial estimated values \(\hat{\alpha} = 0.9999\), \(\hat{\beta^*} = 0.3267\), \(l_t = 10.05\) and \(b_t = 0.22\). The table also extends ten steps beyond the training data, generating the corresponding forecasts, based on the latest fitted values of the level, the slope and the smoothing parameters:
Year
Time
Observation
Level
Slope
Forecast
\(t\)
\(y_t\)
\(l_t\)
\(\hat{y}_{t+1|t}\)
1959
0
10.05
0.22
1960
1
10.28
10.28
0.22
10.28
1961
2
10.48
10.48
0.22
10.50
1962
3
10.74
10.74
0.23
10.70
1963
4
10.95
10.95
0.22
10.97
1964
5
11.17
11.17
0.22
11.17
1965
6
11.39
11.39
0.22
11.39
1966
7
11.65
11.65
0.23
11.61
\(\vdots\)
\(\vdots\)
\(\vdots\)
\(\vdots\)
\(\vdots\)
2014
55
23.50
23.50
0.37
23.52
2015
56
23.85
23.85
0.36
23.87
2016
57
24.21
24.21
0.36
24.21
2017
58
24.60
24.60
0.37
24.57
\(h\)
\(\hat{y}_{T+h|T}\)
2018
1
24.97
2019
2
25.34
2020
3
25.71
2021
4
26.07
2022
5
26.44
2023
6
26.81
2024
7
27.18
2025
8
27.55
2026
9
27.92
2027
10
28.29
Exercise: reproduce this table in excel (template given by the professor)
R Example: Australian Population
Australia’s population from 1960 to 2017.
aus_economy <- global_economy %>%filter(Code =="AUS") %>%mutate(Pop = Population /1e6)autoplot(aus_economy, Pop) +labs(y ="Millions", title ="Australian population")
Clearly, this is trended data. Trended exponential smoothing is therefore sensible.
The fable library performs all the fitting automatically using the following commands
fit <- aus_economy %>%model(AAN =ETS(Pop ~error("A") +trend("A") +season("N")) )
Remember the output is a mable or model table. In other words, a table object containing the model objects we fitted (in this case only one model)
fit
# A mable: 1 x 2
# Key: Country [1]
Country AAN
<fct> <model>
1 Australia <ETS(A,A,N)>
As always, we can check the values of the parameters using the function tidy
tidy(fit)
# A tibble: 4 × 4
Country .model term estimate
<fct> <chr> <chr> <dbl>
1 Australia AAN alpha 1.00
2 Australia AAN beta 0.327
3 Australia AAN l[0] 10.1
4 Australia AAN b[0] 0.222
We may also extract the fitted values and other information using augment. In he example below I also compute the sum of square residuals compute the sum of square residuals using augment()
fitted_vals <- fit %>%augment()resid <- fitted_vals$.innov# Sum of squared residualsSSE <-sum(resid^2)SSE
[1] 0.2231852
We may generate forecasts as usual using forecast() on the fitted object:
fc <- fit %>%forecast(h =10)fc %>%autoplot(aus_economy, level =NULL) +labs(title ="Australian population",y ="Millions") +guides(colour =guide_legend(title ="Forecast"))
As expected, the h-step ahead forecast is equal to the last estimated level plus h times the last estimated trend value. As such, the forecasts are a linear function of h.
Components of the model
They way in which exponential smoothing proceedes may be considered as a decomposition of the time series in terms of multiple components. In the trended case, in level \(l_t\) and slope \(b_t\).
We may use the function components() on an ETS model to retrieve the values of these components in the training region. In the code above, note that we get an additional year 1959 corresponding to \(t=0\), where we have \(l_0\) and \(b_0\).
holts_components <- fit %>%components()holts_components
# A dable: 59 x 7 [1Y]
# Key: Country, .model [1]
# : Pop = lag(level, 1) + lag(slope, 1) + remainder
Country .model Year Pop level slope remainder
<fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Australia AAN 1959 NA 10.1 0.222 NA
2 Australia AAN 1960 10.3 10.3 0.222 -0.000145
3 Australia AAN 1961 10.5 10.5 0.217 -0.0159
4 Australia AAN 1962 10.7 10.7 0.231 0.0418
5 Australia AAN 1963 11.0 11.0 0.223 -0.0229
6 Australia AAN 1964 11.2 11.2 0.221 -0.00641
7 Australia AAN 1965 11.4 11.4 0.221 -0.000314
8 Australia AAN 1966 11.7 11.7 0.235 0.0418
9 Australia AAN 1967 11.8 11.8 0.206 -0.0869
10 Australia AAN 1968 12.0 12.0 0.208 0.00350
# ℹ 49 more rows
# Depict the componentsholts_components %>%autoplot()
Just as we did in the separate excel files, the fitted values may be computed from the trend and the slope returned when using components(), using the forecast equation particularized at \(h=1\) and \(t=t\) (equation derived previousl in this notebook):
\[
\begin{align*}
\text{Fitted value at t}&& \hat{y}_{t|t-1} &= \ell_{t-1} + b_{t-1}
\end{align*}
\]
# Use the equation for the fitted valuesholts_components <- holts_components %>%mutate(.fitted =lag(level) +lag(slope) # Equation for the fitted vals )
We may check that these are the same fitted values returned when using augment() on the model:
# Vector of fitted values computed using augmentfitted_augment <- fit %>%augment() %>%pull(.fitted)# Vector of fitted values computed using the components() and the equationfitted_components <- holts_components %>%pull(.fitted)# Remove NAsfitted_components <- fitted_components[!is.na(fitted_components)]all.equal(fitted_augment, fitted_components)
[1] TRUE
Damped trend methods
Holt’s linear method\(\rightarrow\) constant trend
Problematic for longer forecasts horizons (forecasts too high)
Dampening parameter\(\rightarrow\) dampens the trend line to a flat line at some point in the future.
Very popular method, proven to be quite successful
Additional parameter on top of those included in Holt’s method \(0<\phi<1\)
For \(\phi = 1\) the method is equivalent to Holt’s method
For \(0 < \phi < 1\) the trend is dampened and the forecasts approach a constant value some time into the future. Specifically the value is:
\(\ell_T+\phi b_T/(1-\phi)\) as \(h \rightarrow \infty\)
The reason for the above value is that the dampening term is actually a geometric progression of ratio \(\phi\):
Since the ratio \(\phi\) is smaller than 1, the geometric progression converges and its sum amounts to \(\ell_T+\phi b_T/(1-\phi)\) as \(h \rightarrow \infty\)
Usual values for \(\phi\)
Rarely less than 0.8 (very strong effect for smaller values)
Rarely greater than 0.98 (otherwise very similar to undamped model)
You can see that the damped method returns a lower forecast for every point, as expected.
Damped example 2: specify a range of possible values for \(\phi\) and let ETS() picked the best
Specifying a grid of values for the value of phi: we can specify a range of values for the parameter phi and let the fpp3 library try multiple values until it finds the best fit:
fit_damped_2 <- aus_economy %>%model(holts =ETS(Pop ~error("A") +trend("A") +season("N")),# Specify a grid of values c(0.8, 0.95) for phidamped_holts =ETS(Pop ~error("A") +trend("Ad", phi =NULL, phi_range=c(0.8, 0.95)) +season("N") ) )
tidy(fit_damped_2)
# A tibble: 9 × 4
Country .model term estimate
<fct> <chr> <chr> <dbl>
1 Australia holts alpha 1.00
2 Australia holts beta 0.327
3 Australia holts l[0] 10.1
4 Australia holts b[0] 0.222
5 Australia damped_holts alpha 0.533
6 Australia damped_holts beta 0.533
7 Australia damped_holts phi 0.950
8 Australia damped_holts l[0] 9.96
9 Australia damped_holts b[0] 0.207
# A tibble: 9 × 4
Country .model term estimate
<fct> <chr> <chr> <dbl>
1 Australia holts alpha 1.00
2 Australia holts beta 0.327
3 Australia holts l[0] 10.1
4 Australia holts b[0] 0.222
5 Australia damped_holts alpha 0.999
6 Australia damped_holts beta 0.427
7 Australia damped_holts phi 0.980
8 Australia damped_holts l[0] 10.0
9 Australia damped_holts b[0] 0.248
Number of users connected to the internet via a server. Data observed over 100 minutes. Type WWWusage to find out more details aboout the dataset (loaded along the fpp3 package).
# Convert time series object to tsibble object for compatibility with fable methods and toolswww_usage <-as_tsibble(WWWusage)# Plot the datawww_usage %>%autoplot(value) +labs(x="Minute", y="Number of users",title ="Internet usage per minute")
1. Fit models
Fit the following models:
A Simple exponential smoothing model
A Holts additive model (exp. smoothing with additive trend)
A damped Holts additive model (exp. smoothing with additive damped trend)
2. Briefly compare residuals of Holt’s model and damped Holt’s model
3. Perform forecasts
Use the three models and depict the forecasts. Create two separate graphs:
Graph 1: forecasts of all model along with the historical data. Do not print confidence intervals
Graph 2: forecasts of the damped model with confidence intervals
Exercise 2: Internet usage
This exercise is somewhat of a repetition of exercise 1, only this time using cross validation.
Fit the following models:
A Simple exponential smoothing model
A Holts additive model (exp. smoothing with additive trend)
A damped Holts additive model (exp. smoothing with additive damped trend)
# Convert time series object to tsibble object for compatibility with fable methods and toolswww_usage <-as_tsibble(WWWusage)# Plot the datawww_usage %>%autoplot(value) +labs(x="Minute", y="Number of users",title ="Internet usage per minute")
Use cross-validation to assess which method works best for the data at hand.
Consider forecasts of up to a horizon of up to 10.
The smallest training dataset shall cover 80% of the observations.
Compute the accuracy metrics both for each forecast horizon and model as well as for each model averaged over all forecast horizons.
When you are done, take your best model on average over all forecast horizons and:
Assess its residuals
Forecast 10 timesteps ahead and depict the forecasts along with the prediction intervals and the historical data.
Step 1: create training datasets
Create the different datasets used for cross validation using stretch_tsibble(). Smallest dataset set to 10 observations.
Step 2: fit models to training datasets
Fit the models to be compared to the datasets used for cross-validation:
Step 3: Perform forecasts and compute accuracy metrics